Click-Through Rate Prediction (Kaggle)

In this notebook I'll attempt to build a quick ML model (2-3 hours) to predict whether a mobile ad will be clicked given set of params. Data comes from the kaggle competition available in the follwing link: https://www.kaggle.com/c/avazu-ctr-prediction/overview/evaluation

We do have somewhat large test and training files. Zipped size of training data is over 1GB

In [5]:
! ls -lh | grep .gz
-rw-r--r--  1 oner  staff   135M 11 Dec 13:48 test.gz
-rw-r--r--@ 1 oner  staff   1.0G 11 Dec 13:53 train.gz

It's a good point to stop and discuss our strategy on processing files (data engineering) and training our ML models. Since we have somewhat large dataset we've got following options:

  • Sample the dataset and use pandas + sckitlearn ona smaller dataset (single core). pdpipe is a good choice for lightweight data pipelineing option in pandas domain.
  • Use Dask to do do processing lazily and multiprocess / multinode. This is a good option to get some scalability somewhat very quickly. Another advantage of Dask is the native integration with Prefect which can be used for task scheduling and management (Great data pipelines)
  • Move to Spark, which is another great option for scalable data processing and ML (MLflow, MLLib ..etc), howevever it involves creating Spark clusters, which might be time consuming, considering this will be a quick task (2-3 hours) I tend to skip it for the time being.

Exploratory Data Analysis

Let's have a brief look at the contents of train.gz and test.gz file

In [9]:
! gzcat train.gz | head | csvlook
gzcat: error writing to output: Broken pipe
gzcat: train.gz: uncompress failed
|                         id | click |       hour |    C1 | banner_pos | site_id  | site_domain | site_category | app_id   | app_domain | app_category | device_id | device_ip | device_model | device_type | device_conn_type |    C14 | C15 | C16 |   C17 | C18 | C19 |     C20 | C21 |
| -------------------------- | ----- | ---------- | ----- | ---------- | -------- | ----------- | ------------- | -------- | ---------- | ------------ | --------- | --------- | ------------ | ----------- | ---------------- | ------ | --- | --- | ----- | --- | --- | ------- | --- |
|  1,000,009,418,151,094,273 | False | 14,102,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | ddd2926e  | 44956a24     |        True |                2 | 15,706 | 320 |  50 | 1,722 |   0 |  35 |      -1 |  79 |
| 10,000,169,349,117,863,715 | False | 14,102,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | 96809ac8  | 711ee120     |        True |                0 | 15,704 | 320 |  50 | 1,722 |   0 |  35 | 100,084 |  79 |
| 10,000,371,904,215,119,486 | False | 14,102,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | b3cf8def  | 8a4875bd     |        True |                0 | 15,704 | 320 |  50 | 1,722 |   0 |  35 | 100,084 |  79 |
| 10,000,640,724,480,838,376 | False | 14,102,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | e8275b8f  | 6332421a     |        True |                0 | 15,706 | 320 |  50 | 1,722 |   0 |  35 | 100,084 |  79 |
| 10,000,679,056,417,042,096 | False | 14,102,100 | 1,005 |       True | fe8cc448 | 9166c161    | 0569f928      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | 9644d0bf  | 779d90c2     |        True |                0 | 18,993 | 320 |  50 | 2,161 |   0 |  35 |      -1 | 157 |
| 10,000,720,757,801,103,869 | False | 14,102,100 | 1,005 |      False | d6137915 | bb1ef334    | f028772b      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | 05241af0  | 8a4875bd     |        True |                0 | 16,920 | 320 |  50 | 1,899 |   0 | 431 | 100,077 | 117 |
| 10,000,724,729,988,544,911 | False | 14,102,100 | 1,005 |      False | 8fda644b | 25d4cfcd    | f028772b      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | b264c159  | be6db1d7     |        True |                0 | 20,362 | 320 |  50 | 2,333 |   0 |  39 |      -1 | 157 |
| 10,000,918,755,742,328,737 | False | 14,102,100 | 1,005 |       True | e151e245 | 7e091613    | f028772b      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | e6f67278  | be74e6fe     |        True |                0 | 20,632 | 320 |  50 | 2,374 |   3 |  39 |      -1 |  23 |
| 10,000,949,271,186,029,916 |  True | 14,102,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | 37e8da74  | 5db079b5     |        True |                2 | 15,707 | 320 |  50 | 1,722 |   0 |  35 |      -1 |  79 |
In [10]:
! gzcat test.gz | head | csvlook
gzcat: error writing to output: Broken pipe
gzcat: test.gz: uncompress failed
|                         id |       hour |    C1 | banner_pos | site_id  | site_domain | site_category | app_id   | app_domain | app_category | device_id | device_ip | device_model | device_type | device_conn_type |    C14 | C15 | C16 |   C17 | C18 | C19 |     C20 | C21 |
| -------------------------- | ---------- | ----- | ---------- | -------- | ----------- | ------------- | -------- | ---------- | ------------ | --------- | --------- | ------------ | ----------- | ---------------- | ------ | --- | --- | ----- | --- | --- | ------- | --- |
| 10,000,174,058,809,263,569 | 14,103,100 | 1,005 |      False | 235ba823 | f6ebf28e    | f028772b      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | 69f45779  | 0eb711ec     |        True |                0 |  8,330 | 320 |  50 |   761 |   3 | 175 | 100,075 |  23 |
| 10,000,182,526,920,855,428 | 14,103,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | e8d44657  | ecb851b2     |        True |                0 | 22,676 | 320 |  50 | 2,616 |   0 |  35 | 100,083 |  51 |
| 10,000,554,139,829,213,984 | 14,103,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | 10fb085b  | 1f0bc64f     |        True |                0 | 22,676 | 320 |  50 | 2,616 |   0 |  35 | 100,083 |  51 |
| 10,001,094,637,809,798,845 | 14,103,100 | 1,005 |      False | 85f751fd | c4e18dd6    | 50e219e0      | 51cedd4e | aefc06bd   | 0f2161f8     | a99f214a  | 422d257a  | 542422a7     |        True |                0 | 18,648 | 320 |  50 | 1,092 |   3 | 809 | 100,156 |  61 |
| 10,001,377,041,558,670,745 | 14,103,100 | 1,005 |      False | 85f751fd | c4e18dd6    | 50e219e0      | 9c13b419 | 2347f47a   | f95efa07     | a99f214a  | 078c6b38  | 1f0bc64f     |        True |                0 | 23,160 | 320 |  50 | 2,667 |   0 |  47 |      -1 | 221 |
| 10,001,521,204,153,353,724 | 14,103,100 | 1,005 |       True | 57fe1b20 | 5b626596    | f028772b      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | e75922ff  | 68b6db2c     |        True |                0 |  6,563 | 320 |  50 |   572 |   2 |  39 |      -1 |  32 |
| 10,001,911,056,707,023,378 | 14,103,100 | 1,005 |      False | 1fbe01fe | f3845767    | 28905ebd      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | f1e8683d  | d4897fef     |        True |                0 | 22,813 | 320 |  50 | 2,647 |   2 |  39 | 100,148 |  23 |
| 10,001,982,898,844,213,216 | 14,103,100 | 1,005 |      False | 85f751fd | c4e18dd6    | 50e219e0      | 388d9bfb | 2347f47a   | cef3e649     | 3772665a  | a4a540c1  | a2140f4f     |        True |                3 | 23,214 | 300 | 250 | 2,675 |   3 | 939 | 100,058 | 100 |
| 10,002,000,217,531,288,531 | 14,103,100 | 1,005 |      False | 543a539e | c7ca3108    | 3e814130      | ecad2386 | 7801e8d9   | 07d7df22     | a99f214a  | dc17d849  | ac9ad752     |        True |                0 | 23,642 | 320 |  50 | 2,709 |   3 |  39 |      -1 |  23 |

Looks like we need to predict click column

for convenience I'll extract the zipped files, to make it easier for dask to read

In [ ]:
! gunzip test.gz && gunzip train.gz
In [24]:
! ls -lh | grep 'test\|train'
-rw-r--r--  1 oner  staff   673M 11 Dec 13:48 test
-rw-r--r--  1 oner  staff   5.9G 11 Dec 13:53 train

As expected uncompressed files are about 6x bigger than compressed versions

There are number of columns we can use to build our model, however before going any further, let's decide which columns we can use. There are number of things we can do here but I'd like do check the distribution of values briefly.My strategy would be:

  • Use pandas-profiling for a quick visualization
  • Since Data sets are big for pandas I'd sample them in dask and have a smaller pandas dataframe to visualize.
In [1]:
# housekeeping
import pandas as pd
pd.set_option('display.max_columns', 30)

import pandas as pd
import dask.dataframe as dd
from pandas_profiling import ProfileReport

Use dask to read and sample data

  • Create 4 workers
  • Read training data and count rows
  • Sample training data
In [3]:
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1, processes=False, memory_limit='2GB')
client
Out[3]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 8.00 GB
In [4]:
# read training lazily with dask
train = dd.read_csv('train', dtype={'id': 'float64'})

we've got ~ 40.4 million rows of training data

In [5]:
%%time
# sample with replacement (though with replacement doesn't really matter as the data set is big enough)
s_train = train.sample(frac=0.001, random_state=42).compute()
CPU times: user 1min 54s, sys: 16.5 s, total: 2min 10s
Wall time: 1min 8s
In [6]:
s_train.head()
Out[6]:
id click hour C1 banner_pos site_id site_domain site_category app_id app_domain app_category device_id device_ip device_model device_type device_conn_type C14 C15 C16 C17 C18 C19 C20 C21
109052 8.615899e+18 1 14102100 1005 0 38217daf 449497bc f028772b ecad2386 7801e8d9 07d7df22 a99f214a 925ab50e ac221d6c 1 0 20345 300 250 2331 2 39 -1 23
265550 1.073791e+19 0 14102102 1005 1 5114c672 3f2f3819 3e814130 ecad2386 7801e8d9 07d7df22 a99f214a 7ce8e95a 5ec45883 1 0 19771 320 50 2227 0 687 100075 48
141300 1.268349e+19 0 14102101 1002 0 ab526063 71ae4aea 50e219e0 ecad2386 7801e8d9 07d7df22 184ca3bb 33b0bd88 072c9f1e 0 0 16920 320 50 1899 0 431 -1 117
210894 4.520914e+18 1 14102101 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 07d7df22 a99f214a 8eaef965 d870e4de 1 0 15701 320 50 1722 0 35 100084 79
63661 2.245924e+18 0 14102100 1005 0 85f751fd c4e18dd6 50e219e0 685d1c4c 2347f47a 8ded1f7a 17f6eecd c1a367a4 158e4944 1 3 15704 320 50 1722 0 35 100083 79

Save the sampled dataset to use later in model building:

In [222]:
s_train.to_pickle('data/ctr_data.pickle', compression='gzip')

Evaluate Columns

Let's have a quick look at the columns to decide which ones to use in our ML model.

I'll use pandas-profiling to get a nice view

In [2]:
s_train = pd.read_pickle('data/ctr_data.pickle', compression='gzip')
report = ProfileReport(s_train, title='Train Data Report', explorative=True)
In [3]:
# Save to file for offline viewing
report.to_file('Initial_Report.html')




In [4]:
# view report inside Jupyter Notebook
# NOTE: If you are opening up this ipynb file in github the provfiling report won't be visible, please download repo and view the outputted html file to see report
report.to_notebook_iframe()

by looking at the pandas profiling output we can make few assumptions :

  • click has two values (0, 1), as expected and click (1) is not an rare event, roughly ~6/33.
  • hour needs conversion, and we can create few categories out of hour. (format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.)
  • C1 column has 7 different values though 3 most common values are responsible for more than 99% of. So we can filter to values (1005, 1002 and 1010). C1 seems to be correlated with device_type
  • banner_pos has 6 vals but (0 and 1) covers 99% of the cases
  • site_id has high cardinality (1070 side-ids), and it's categorical variable, so it doesn't make much sense to use this (one hot encoding) to predict clicks. Discarded
  • site_domain also has high cardinality and doesn't make much sense to use as predictor.
  • site_category has 18 unique values and 4 of them are responisble for 98% of the cases,(50e219e0, f028772b, 28905ebd, 3e814130)
  • app_category 4 values are responsible for most of the cases : (07d7df22, 0f2161f8, cef3e649, 8ded1f7a, f95efa07)
  • app_id, app_domain, device_id, device_ip, device_model all have high cardinalities, if we know device_ip we could have enriched it with geotag but encoded values it's not much of use.
  • device_type can take 4 different values, and value 1 accounts for more than 92% percent of the cases. Probably not a very good predictor. Will think about it later.
  • device_conn_type can take 4 vals and (0, 2 and 3) are responsible for 99.9% of the cases.
  • c14 and c17 have high correlation in between, so it make sense to keep only one. However C14 or C17 looks like categorical variables with high cardinality so I'll drop these cols.
  • c15 takes many values but (320, and 300) covers more than 99% of the cases.
  • c16 takes many values but (50, and 250) covers 99% of the cases. Filter for those values.
  • c18 takes 4 values and they all seem to be common (value 1 is the least common but appears in 7% of the cases). One hot encoding.
  • c19 has high cardinality, skipping
  • c20 looks like some ids, with 6 digits, such as 100084, but apart from 6 digit numbers there seems to be -1 in 46.6% cases. So we can potentially divide thist to -1 or others.

    Based on the analysis above, I'll filter the sampled dataset:

In [5]:
cols_to_drop = ['id', 'site_id', 'site_domain', 'app_id', 'app_domain', 'device_id', 'device_ip', 'device_model', 'device_type', 'C14', 'C17', 'C19', 'C21']
In [6]:
def clean_data(df: pd.DataFrame, cols_to_drop: list) -> pd.DataFrame:
    """
    Define all operations to make training + test data ready for prediction/model building
    """
    # Drop columns those deemed unnecessary
    df = df.drop(columns=cols_to_drop, axis=0)

    # Reset sampled indexes
    df = df.reset_index(drop=True)

    # Convert hour field to datetime
    df['date'] = pd.to_datetime(df['hour'], format='%y%m%d%H')

    # and set the hour field to correct hour (Hour might have some predictive power on click through rate)
    df['hour'] = df.date.dt.hour

    # extract day of the week (maybe on weekends or some specific days people's habit is different)
    df['day'] = df.date.dt.day_name()

    # C1 conditions
    c1_conds = {1005, 1002, 1010}
    df['C1'] = df['C1'].apply(lambda x: x if x in c1_conds else -1)

    # banner_pos conditions
    banner_pos = {0, 1}
    df['banner_pos'] = df['banner_pos'].apply(lambda x: x if x in banner_pos else -1)

    # site category conditions
    site_cats = {'50e219e0', 'f028772b', '28905ebd', '3e814130'}
    df['site_category'] = df['site_category'].apply(lambda x: x if x in site_cats else 'others')
    
    # app_category conditions
    app_cats = {'07d7df22', '0f2161f8', 'cef3e649', '8ded1f7a', 'f95efa07'}
    df['app_category'] = df['app_category'].apply(lambda x: x if x in app_cats else 'others')
    
    # C15 conditions
    c15_conds = {320, 300}
    df['C15'] = df['C15'].apply(lambda x: x if x in c15_conds else -1)
    
    # C16 conditions
    c16_conds = {50, 250}
    df['C16'] = df['C16'].apply(lambda x: x if x in c16_conds else -1)
    
    # C20 conditions
    df['C20'] = df['C20'].apply(lambda x: False if x == -1 else True)
    
    # No need to keep duplicates
    df = df.drop_duplicates().reset_index(drop=True)
    
    return df
In [7]:
cleaned_train = clean_data(s_train, cols_to_drop=cols_to_drop)

Let's run the profiling again to verify our operations

In [8]:
report = ProfileReport(cleaned_train, title='Cleaned Train Data Report', explorative=True)

# first save to disk for offline viewing
report.to_file('Cleaned_Report.html')

# view inside jupyter notebook (If you cannot see any report here, open up 'Cleaned_Report.html' from disk)
report.to_notebook_iframe()




The ML Pipeline

Note: The following code assumes we work on unzipped (train.gz) and sampled dataset. Since I can only spare no more than 3 hours on this assignment I decided to work ona smaller memory fitting data for testing and training

Strategy

  • Clean Data (Done in the steps above) and
  • One hot encode columns
  • Split test and training data
  • Build an AdaBooostClassifier . I chose AdaBoost as it by default uses DecisionTrees (which are easy to interpret but prone to overfitting) and ensemble methods like AdaBoost deals with overfitting issues and generally have better prefictive capabilities than base models.
  • Decision Trees and AdaBoostClassifier has many hyperparams so use GridSearchCV to find optimum params
  • While fitting model, optimize for a specific score, accuracy is the first one to think, but highly likely we'd like to prefict the clicks rather than non-clicks therefore high recall model would be better. Train for both
  • Evaluate the results
In [1]:
import pandas as pd 
from sklearn.metrics import make_scorer, fbeta_score, accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, f1_score

# Helper functions, configs
from beeswax_assignment.wrangling import clean_data
from beeswax_assignment.constants import cols_to_drop, one_hot_encoded_cols, SRC_DATA_PATH, f_BETA


def load_data(ctr_data):
    """ 
    Load ctr_data and create feature and label vectors
    
    Args:
      ctr_data (Path): filepath where samped data is located
    Returns:
      X (pandas.DataFrame): Dataframe, used as features
      Y (numpy.Array): Numpy array. Used as labels (click column)
      label_names (list): label names 
    """
    
    # read sampled data from disk
    df = pd.read_pickle(ctr_data, compression='gzip')
    
    # clean the data (based on exploratory analysis)
    df = clean_data(df, cols_to_drop=cols_to_drop)
    
    # apply one hot encoding for the categorical columns
    df = pd.get_dummies(df, columns=one_hot_encoded_cols).reset_index(drop=True)
    
    # Target label
    y = df['click'].to_numpy()
    
    # Features
    X = df.drop(['click', 'date'], axis=1)
    
    # feature names
    feature_names = list(X.columns)
    
    return X, y, feature_names


def build_model(scorer):
    """
    Build machine learning pipeline
    
    Args:
      scorer: sklearn.metrics.make_scorer: What score (f-Beta, accuracy..etc) should
                                           we optimize for
    Returns:
      sklearn.model_selection.GridSearchCV: ML model (and pipeline) to build ML model 
                                            using GridSearchCV and AdaBoostClassifier
    """    

    pipeline = Pipeline([
        ('clf', AdaBoostClassifier(random_state=42))
    ])
    
    parameters = {
        'clf__n_estimators': [10, 30, 50, 100, 200],
        'clf__learning_rate': [0.5, 1., 10] 
    }

    # Use GridSearchCV to find optimum hyperparameters for ML model
    model = GridSearchCV(pipeline, param_grid=parameters, scoring=scorer)
    
    return model 

def evaluate_model(model, X_test, y_test):
    """ 
    Evaluate model performance
    
    Args:
      model (sklearn.model_selection.GridSearchCV): Trained ML model
      X_test (pandas.DataFrame): Test feature set, here these are tweets.
      y_test (pandas.DataFrame): Test label set
    Prints:
      str: classification preport and accuracy scores
    Returns:
        None
    """
    
    # Classify the tweets using test set
    y_pred = model.predict(X_test)
    
    print("Classification report: \n",classification_report(y_test, y_pred))
    print("Accuracy score: {}".format(accuracy_score(y_test, y_pred)))
    print("f1 score: {}\n\n".format(f1_score(y_test, y_pred, average='macro')))

Run ML Pipeline

In [2]:
# Load data
X, y, feature_names = load_data(SRC_DATA_PATH)

# Split the 'features' and target:'click' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

High Accuracy Model

In [3]:
# Create scorer
scorer_accuracy = make_scorer(accuracy_score)

print('Building high accuracy model...')
model_accuracy = build_model(scorer_accuracy)

print('Training high accuracy model...')
model_accuracy.fit(X_train, y_train)

print('Evaluating high accuracy model...')
evaluate_model(model_accuracy, X_test, y_test)

best_model = model_accuracy.best_estimator_['clf']
print('\nOptimized params for High accuracy model is {}'.format(best_model))

important_features = best_model.feature_importances_
features_sorted = pd.Series(data= important_features, index=feature_names).sort_values(ascending=False).head(10)
print('Top 10 important features for high accuracy model is \n\n{}'.format(features_sorted))
Building high accuracy model...
Training high accuracy model...
Evaluating high accuracy model...
Classification report: 
               precision    recall  f1-score   support

           0       0.73      1.00      0.84      3574
           1       0.60      0.01      0.02      1336

    accuracy                           0.73      4910
   macro avg       0.66      0.50      0.43      4910
weighted avg       0.69      0.73      0.62      4910

Accuracy score: 0.7287169042769858
f1 score: 0.4301633571440521



Optimized params for High accuracy model is AdaBoostClassifier(learning_rate=0.5, n_estimators=30, random_state=42)
Top 10 important features for high accuracy model is 

site_category_others      0.100000
C16_250                   0.100000
C18_1                     0.066667
device_conn_type_2        0.066667
app_category_f95efa07     0.066667
site_category_28905ebd    0.066667
site_category_3e814130    0.066667
C15_-1                    0.033333
device_conn_type_5        0.033333
device_conn_type_0        0.033333
dtype: float64

High Recall Model

In [5]:
scorer_fbeta = make_scorer(fbeta_score, beta=f_BETA)

print('Building high recall model...')
model_beta = build_model(scorer_fbeta)

print('Training high recall model...')
model_beta.fit(X_train, y_train)

print('Evaluating high recall model...')
evaluate_model(model_beta, X_test, y_test)

best_model = model_beta.best_estimator_['clf']
print('\nOptimized params for High recall model is {}'.format(best_model))

important_features = best_model.feature_importances_
features_sorted = pd.Series(data= important_features, index=feature_names).sort_values(ascending=False).head(10)
print('Top 10 important features for high recall model is \n\n{}'.format(features_sorted))
Building high recall model...
Training high recall model...
Evaluating high recall model...
Classification report: 
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      3574
           1       0.27      1.00      0.43      1336

    accuracy                           0.27      4910
   macro avg       0.14      0.50      0.21      4910
weighted avg       0.07      0.27      0.12      4910

Accuracy score: 0.2720977596741344
f1 score: 0.2138968940121678



Optimized params for High recall model is AdaBoostClassifier(learning_rate=10, n_estimators=10, random_state=42)
Top 10 important features for high recall model is 

C18_2      0.2
C18_3      0.0
C18_1      0.0
C18_0      0.0
C16_250    0.0
C16_50     0.0
C16_-1     0.0
C15_320    0.0
C15_300    0.0
C15_-1     0.0
dtype: float64
/Users/oner/.virtualenvs/generic3.8/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Though this was a very quick model building excercise, there are many aspects are disregarded here since I'm only restricted to 3 hours to complete this assignment.

  • Apart from hour to weekdays conversion we haven't done any feature engineering, potentilly we can do more feature engineering
  • PCA analysis would be useful to narrow down the important features
  • Dataset labels are not always very helpful, clearer labels would help us to decide which columns to consider/drop or further feature engineer.
  • I just simply chose one supervised learning algorithm with a fine balance for interpretability and accuracy. We could have considered otehr Supervised learnign algos, or Deep learning models (though less intrepretable but feature selection becomes much easier in DNNs)
  • Supervised ML algorithms like Logistic regression, or NaiveBayes would give us probability which is easier to tune for ROC curve. That would be good to explore.
  • There is no real Data Pipeline in this notebook due to time restriction, but in real world, I would use tools/libraries like Apache Airflow, Prefect, or Luigi to stich together each steps.
  • Since it is just notbook I used print statements in many places, in prod setting I would use logging instead of printing
In [ ]: